The dataset I have chosen consists of airbnb listing data from San Francisco collected at 28 different dates between November 2013 and July 2017. I originally downloaded the datasets from http://tomslee.net/airbnb-data-collection-get-the-data as individual files for each date, which I then combined into a single file. For more details on this, see union_script.R in the p6 folder.
For details on the content of the columns, see http://tomslee.net/airbnb-data-collection-get-the-data.
I will start my analysis by taking a look at some records, to get an initial feel of the dataset.
## room_id host_id room_type borough neighborhood reviews
## 1 8014 22402 Private room NA Outer Mission 15
## 2 10832 38836 Entire home/apt NA Downtown/Civic Center 2
## 3 26488 112300 Entire home/apt NA Financial District 48
## 4 45900 204441 Entire home/apt NA Financial District 158
## 5 54518 255971 Entire home/apt NA Financial District 5
## 6 56489 40784 Entire home/apt NA South of Market 2
## overall_satisfaction accommodates bedrooms price minstay latitude
## 1 4.5 1 4 49 2 37.73075
## 2 5.0 4 1 172 30 37.78590
## 3 5.0 2 1 1097 360 37.79090
## 4 4.5 6 3 219 1 37.78858
## 5 4.5 2 1 187 30 37.79330
## 6 5.0 4 2 1099 30 37.78915
## longitude last_modified date_collected
## 1 -122.4484 2013-12-08 14:09:52 2013-11-17
## 2 -122.4083 2013-12-08 18:54:01 2013-11-17
## 3 -122.3933 2013-12-07 23:36:57 2013-11-17
## 4 -122.4048 2013-12-07 03:09:04 2013-11-17
## 5 -122.4008 2013-12-07 03:32:39 2013-11-17
## 6 -122.3895 2013-12-07 01:48:46 2013-11-17
## room_id host_id room_type borough neighborhood reviews
## 205527 13886867 19206511 Private room NA Ocean View 8
## 205528 15481646 160490 Private room NA Haight Ashbury 0
## 205529 14268636 24091826 Private room NA Excelsior 49
## 205530 9033827 37570945 Private room NA Excelsior 3
## 205531 15605282 5020080 Private room NA Inner Richmond 12
## 205532 18089644 124497978 Private room NA South of Market 0
## overall_satisfaction accommodates bedrooms price minstay latitude
## 205527 4.5 2 1 40 NA 37.71325
## 205528 0.0 1 1 10 NA 37.76497
## 205529 5.0 2 1 39 NA 37.72478
## 205530 4.5 1 1 40 NA 37.72581
## 205531 5.0 2 1 38 NA 37.78226
## 205532 0.0 1 1 10 NA 37.77123
## longitude last_modified date_collected
## 205527 -122.4582 46:31.1 2017-07-10
## 205528 -122.4520 46:31.1 2017-07-10
## 205529 -122.4319 46:31.1 2017-07-10
## 205530 -122.4038 46:31.1 2017-07-10
## 205531 -122.4778 46:31.1 2017-07-10
## 205532 -122.4042 46:31.1 2017-07-10
Next I’ll get some information about the data types.
## 'data.frame': 205532 obs. of 15 variables:
## $ room_id : int 8014 10832 26488 45900 54518 56489 64332 70284 70753 71370 ...
## $ host_id : int 22402 38836 112300 204441 255971 40784 40784 329072 329072 364983 ...
## $ room_type : Factor w/ 4 levels "","Entire home/apt",..: 3 2 2 2 2 2 2 4 4 2 ...
## $ borough : logi NA NA NA NA NA NA ...
## $ neighborhood : Factor w/ 37 levels "Bayview","Bernal Heights",..: 22 7 9 9 9 32 32 4 9 22 ...
## $ reviews : int 15 2 48 158 5 2 14 8 70 10 ...
## $ overall_satisfaction: num 4.5 5 5 4.5 4.5 5 4 4.5 4.5 4.5 ...
## $ accommodates : int 1 4 2 6 2 4 6 1 4 6 ...
## $ bedrooms : int 4 1 1 3 1 2 2 4 1 2 ...
## $ price : int 49 172 1097 219 187 1099 350 27 30 131 ...
## $ minstay : int 2 30 360 1 30 30 2 30 1 3 ...
## $ latitude : num 37.7 37.8 37.8 37.8 37.8 ...
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ last_modified : Factor w/ 189810 levels "00:00.4","00:02.4",..: 1959 1986 1914 1832 1838 1819 1886 1934 1805 1999 ...
## $ date_collected : Factor w/ 28 levels "2013-11-17","2014-05-11",..: 1 1 1 1 1 1 1 1 1 1 ...
#Setting correct format for dates
airbnb$date_collected <- as.Date(airbnb$date_collected)
I now want to get some summary statistics for the different fields in the dataframe.
## room_id host_id room_type
## Min. : 958 Min. : 46 : 43
## 1st Qu.: 2433743 1st Qu.: 2597104 Entire home/apt:121681
## Median : 6988943 Median : 8539143 Private room : 76404
## Mean : 7020337 Mean : 18201785 Shared room : 7404
## 3rd Qu.:10772374 3rd Qu.: 25805964
## Max. :19781990 Max. :139553832
## NA's :6
## borough neighborhood reviews
## Mode:logical Mission : 25487 Min. : 0.00
## NA's:205532 Western Addition : 20223 1st Qu.: 1.00
## South of Market : 15860 Median : 5.00
## Castro/Upper Market : 12098 Mean : 21.07
## Downtown/Civic Center: 11404 3rd Qu.: 22.00
## Haight Ashbury : 10524 Max. :513.00
## (Other) :109936 NA's :49
## overall_satisfaction accommodates bedrooms price
## Min. :0.00 Min. : 1.000 Min. : 0.000 Min. : 0
## 1st Qu.:4.50 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 108
## Median :5.00 Median : 2.000 Median : 1.000 Median : 167
## Mean :3.96 Mean : 3.082 Mean : 1.346 Mean : 252
## 3rd Qu.:5.00 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.: 256
## Max. :5.00 Max. :18.000 Max. :10.000 Max. :30000
## NA's :46354 NA's :8421 NA's :10690
## minstay latitude longitude
## Min. : 1.00 Min. :37.71 Min. :-122.5
## 1st Qu.: 1.00 1st Qu.:37.75 1st Qu.:-122.4
## Median : 2.00 Median :37.77 Median :-122.4
## Mean : 3.53 Mean :37.77 Mean :-122.4
## 3rd Qu.: 3.00 3rd Qu.:37.79 3rd Qu.:-122.4
## Max. :1000.00 Max. :37.83 Max. :-122.4
## NA's :71477
## last_modified date_collected
## 2015-08-21 16:54:47.397989: 3974 Min. :2013-11-17
## 48:03.8 : 24 1st Qu.:2016-02-17
## 25:46.6 : 18 Median :2016-07-17
## 49:41.8 : 18 Mean :2016-07-02
## 40:27.5 : 17 3rd Qu.:2017-01-14
## 43:16.5 : 17 Max. :2017-07-10
## (Other) :201464
I am already seeing some trends in the data. For example, based on the 1st quartile and max value for overall_satisfication, I suspect that very few listings have ratings below 4 of 5. Let’s plot this to take a closer look.
The plot confirms my suspicion: 81.9% of the ratings are 4 or higher, with more than 50% of the ratings being 5. However, it is interesting to see that 16.8% of the rooms have a rating of 0. Let’s take a closer look at some summary statistics for these records to see if there’s a data quality issue.
## room_id host_id room_type
## Min. : 6810 Min. : 316 : 0
## 1st Qu.: 8307860 1st Qu.: 4746287 Entire home/apt:15508
## Median :10948458 Median : 16570082 Private room :10570
## Mean :11360472 Mean : 28935038 Shared room : 597
## 3rd Qu.:15529178 3rd Qu.: 44780730
## Max. :19781990 Max. :139553832
##
## neighborhood reviews overall_satisfaction
## Mission : 2854 Min. :0.0000 Min. :0
## Western Addition : 2483 1st Qu.:0.0000 1st Qu.:0
## South of Market : 2475 Median :0.0000 Median :0
## Downtown/Civic Center: 2089 Mean :0.5638 Mean :0
## Haight Ashbury : 1322 3rd Qu.:1.0000 3rd Qu.:0
## Bernal Heights : 1285 Max. :6.0000 Max. :0
## (Other) :14167
## accommodates bedrooms price minstay
## Min. : 1.000 Min. : 0.000 Min. : 10.0 Min. : NA
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 100.0 1st Qu.: NA
## Median : 2.000 Median : 1.000 Median : 180.0 Median : NA
## Mean : 3.202 Mean : 1.375 Mean : 309.2 Mean :NaN
## 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.: 300.0 3rd Qu.: NA
## Max. :16.000 Max. :10.000 Max. :30000.0 Max. : NA
## NA's :26675
## latitude longitude last_modified date_collected
## Min. :37.71 Min. :-122.5 45:06.5: 15 Min. :2016-12-23
## 1st Qu.:37.76 1st Qu.:-122.4 25:57.8: 14 1st Qu.:2017-01-14
## Median :37.77 Median :-122.4 25:28.6: 13 Median :2017-03-12
## Mean :37.77 Mean :-122.4 47:37.9: 13 Mean :2017-03-17
## 3rd Qu.:37.79 3rd Qu.:-122.4 25:46.7: 12 3rd Qu.:2017-04-08
## Max. :37.83 Max. :-122.4 40:32.5: 12 Max. :2017-07-10
## (Other):26596
The first thing I notice is that there seem to be a large amount of records with 0 reviews. According to the dataset description the overall_satisfaction consists of “The average rating (out of five) that the listing has received from those visitors who left a review.” We can therefore assume that records with 0 reviews should have rating set to NA, not 0-5. Let’s take a look at how many records have 0 reviews and an overall_satisfaction between 0 and 5.
## [1] 15945
About 8% of the records match this criteria. Let’s plot the overall_satisfaction for these records.
It looks like all 0 reviews records with a value in overall_satisfaction has it set to 0. I’ll run a filtered querry to make sure.
## room_id host_id room_type neighborhood reviews overall_satisfaction
## 1 1097480 4955917 Private room Outer Richmond 0 3
## 2 1097480 4955917 Private room Outer Richmond 0 3
## 3 1097480 4955917 Private room Outer Richmond 0 3
## accommodates bedrooms price minstay latitude longitude
## 1 1 1 123 5 37.77496 -122.5009
## 2 1 1 169 5 37.77496 -122.5009
## 3 1 1 192 5 37.77496 -122.5009
## last_modified date_collected
## 1 2014-05-11 23:15:02.110 2014-05-11
## 2 2014-08-24 23:50:05.640 2014-08-24
## 3 2015-02-19 11:15:41.282 2015-02-19
3 out of 205k records is practically 0. I will now plot the overall_satisfaction column again excluding the 0 reviews records. I will also update the airbnb dataframe and change the overall_satisfaction score from 0 to NA for these records.
When excluding the zero reviews records, the percentage of records with an overall ranking of 0 goes down to 7.5. That’s still a fair amount, but it’s much more believeable than before. Let’s run a summary query on these records to see if we can spot any trends.
## room_id host_id room_type
## Min. : 6810 Min. : 316 : 0
## 1st Qu.: 7764330 1st Qu.: 3990783 Entire home/apt:6115
## Median :11377038 Median : 13488074 Private room :4398
## Mean :11053131 Mean : 26404572 Shared room : 220
## 3rd Qu.:15222764 3rd Qu.: 39631194
## Max. :19592343 Max. :137719050
##
## neighborhood reviews overall_satisfaction
## Mission :1273 Min. :1.000 Min. :0
## South of Market : 917 1st Qu.:1.000 1st Qu.:0
## Western Addition : 913 Median :1.000 Median :0
## Downtown/Civic Center: 861 Mean :1.401 Mean :0
## Bernal Heights : 574 3rd Qu.:2.000 3rd Qu.:0
## Haight Ashbury : 511 Max. :6.000 Max. :0
## (Other) :5684
## accommodates bedrooms price minstay
## Min. : 1.000 Min. : 0.000 Min. : 10.0 Min. : NA
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: NA
## Median : 2.000 Median : 1.000 Median : 155.0 Median : NA
## Mean : 3.074 Mean : 1.322 Mean : 228.1 Mean :NaN
## 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.: 250.0 3rd Qu.: NA
## Max. :16.000 Max. :10.000 Max. :10000.0 Max. : NA
## NA's :10733
## latitude longitude last_modified date_collected
## Min. :37.71 Min. :-122.5 29:05.3: 7 Min. :2016-12-23
## 1st Qu.:37.76 1st Qu.:-122.4 46:58.0: 7 1st Qu.:2017-01-14
## Median :37.77 Median :-122.4 29:06.9: 6 Median :2017-03-12
## Mean :37.77 Mean :-122.4 40:33.6: 6 Mean :2017-03-16
## 3rd Qu.:37.79 3rd Qu.:-122.4 46:59.7: 6 3rd Qu.:2017-04-08
## Max. :37.83 Max. :-122.4 47:32.0: 6 Max. :2017-07-10
## (Other):10695
What stands out to me is that a large portion of these records have only one review. Let’s compare the amount of reviews of these records with the full dataset.
Almost all the listings with overall_satisfaction score 0 have either 1 or 2 reviews. Next let’s take a look at the price distribution.
Looks like there’s some extreme outliers for the price variable. Let’s zoom in to get a better sense of price distribution
## 95% 96% 97% 98% 99% 100%
## 650 750 900 1000 1500 30000
The 98th price percentile is at 1000, and I’ll use that as the price cutoff to zoom in on the data.
Setting the max price to 1000, which excludes the 2% most expensive units, gives a clearer picture of the price distribution. The plot is heavily right-skewed, with most units being priced below 250.
I suspect size of the units heavily affect price. Therefore I will make a new column for price per bedroom.
# Creating the column
airbnb$price_per_bedroom = airbnb$price / airbnb$bedrooms
# Removing infinite values
is.na(airbnb$price_per_bedroom) <- do.call(cbind,lapply(airbnb$price_per_bedroom, is.infinite))
airbnb %>%
group_by(room_type) %>%
summarise(n = n(), avg_price = mean(price, na.rm = TRUE),
avg_bedroom_price = mean(price_per_bedroom, na.rm = TRUE))
## # A tibble: 4 x 4
## room_type n avg_price avg_bedroom_price
## <fctr> <int> <dbl> <dbl>
## 1 43 162.60465 NaN
## 2 Entire home/apt 121681 336.69836 205.42191
## 3 Private room 76404 132.44443 131.93803
## 4 Shared room 7404 95.03025 93.30906
Price_per_bedroom percentiles:
## 95% 96% 97% 98% 99% 100%
## 375 400 491 550 800 28000
Unsurprisingly, the trend is very similar for the price and price_per_bedroom plots: the distribution is heavily right-skewed.
Almost two thirds of all units are one-room rentals. Room type is probably a factor here. I will revisit this in the bivariate plots section.
Almost all units have either room_type Entire home/apt and Private room. Later on it will be interesting to see if this changes over time.
Minstay percentiles:
## 95% 96% 97% 98% 99% 100%
## 10 14 28 30 30 1000
It looks like most Airbnb hosts require a minimum stay between 1 and 3 nights, with few units requiring more than 7 nights. It is interesting to see that many more units have a minimum stay of 7 and 30 nights than 6 nights, which is natural considering there are 7 days in a week, and (roughly) 30 days in a month. There is also a slight increase at 10 days (round number) and 14 days (2 weeks).
The neighborhood distribution is very spread, with some very large and some very small neighborhoods. Although it is outside the scope of this project, it would be interesting to compare the neighborhood proportions in the airbnb dataset to population and housing data from San Francisco. This could reveal whether being an airbnb host is much more common in some neighborhoods compared to other neighborhoods.
While it is most common for units to accommodate 2 people, accommodating 4 people is also rather common. Few listings accommodate more than 6 people.
There are 205,532 Airbnb listings in the dataset. The listings were collected at 28 different dates between November 2013 and October 2017. Each record consists of 6 numeric, 2 nominal, and 2 ID variables (room_id and host_id). There are also 2 date columns and 2 columns with geographical location (longitude and lattitude).
From my initial analysis I consider price and date to be the main features of interest in the data set.
Price variations across neighborhoods will be interesting to take a look at. I believe number of bedrooms and room type are other interesting parameters which will greatly affect the price.
Yes, I created price per bedroom, as that seemed to be the best indicator of size. Without taking unit size into account it is very hard to get a true price picture.
It is a little surprising to me how large the portion of 1 bedroom apartments is. More than half of the units on the market fit into this category. There’s a chance some of these records really inform about the number of rooms available, not the actual amount of rooms in the unit. It will be interesting to later take a look at the bedroom distribution across room type.
Yes. I initially downloaded 28 separate files which I merged to one file using the tidy function union. For full details about this process, see https://github.com/gisledb/udacity_nanodegree/blob/master/p6_data_visualization_with_Tableau/Project/union_script.R.
In the file you are reading now I have changed the data type of the date_collected column from factor to date. I also removed one “dead” variable, borough, as all the values were set to NA. During my analysis I found a data error which I corrected: 15 945 records with 0 reviews did not have overall_satisfaction set to NA, and >99.9% of these records had overall_satisfaction set to 0. I changed the overall_satisfaction of these records to NA.
In this section, among other things I want to take a closer look at the relationship between price and neighborhood, and price and time.
Before I go any further, I want to make sure the neighborhood data is correct. Since the dataset contains latitude and longitude data, this is fairly easy to check using a map plot.
This is the result I was hoping for. The data points are nicely clustered within their respective neighborhoods. Based on local knowledge I can confirm that the names of the neighborhoods are correct.
## # A tibble: 37 x 6
## neighborhood n mean med min max
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Presidio Heights 1039 494.3898 200 49 10000
## 2 Presidio 155 439.9355 195 60 1500
## 3 Russian Hill 5925 355.0160 212 31 9995
## 4 Pacific Heights 5645 351.4376 220 10 9900
## 5 Marina 7670 322.4735 225 20 8964
## 6 Financial District 3313 312.1150 180 10 28000
## 7 South of Market 15860 303.3232 170 0 30000
## 8 Chinatown 2781 291.6559 179 19 10000
## 9 Potrero Hill 6720 287.4071 190 36 9000
## 10 North Beach 4169 277.6697 195 10 9999
## # ... with 27 more rows
Presidio Heights and Presidio are clearly the most expensive neighborhoods, with an average price per night above $440. The most affordable neighborhods are Treasure Island, Crocker Amazon and Lakeshore. This makes sense, as they are all on the outskirts of the San Francisco city limits. Based on my local knowledge of the city there are no surprises in this figure.
When we compare the price to price per bedroom, Presidio Heights is no longer the most expensive neighborhoods. Now it seems like Downtown and the Financial District has the most expensive units. One thing to note is that this plot excludes any units with zero bedrooms (roughly 13% of the records). Let’s see if looking at median instead of mean changes things.
Looking at median instead of mean, the situation changes quite a bit. For absolute price, Presidio Heights and Presidio drop to 4th and 5th places, while Marina, Pacific Heights and Russian Hill now occupy the top 3. Presidio has the highest price per bedroom, with Marina and Chinatown having the 2nd and 3rd highest bedroom price.
Comparing the mean and median plots, it seems like some neighborhoods have some very expensive units, increasing the mean values. Especially Presidio and Presidio Heights seems to be affected by this. It doesn’t seem to be same case the other way around, as all neighborhoods seem to have a higher mean than median price. Lets create another plot to confirm this.
As expected, no neighborhood has a higher median price than mean price.
In this plot, which compares median and mean price per neighborhood, we see that the median price in Presidio Heights is almost 2.5 times greater than the mean price. From the univariate plot section I already know that the dataset has some large price outliers. A box plot might reveal whether the outliers are distributed across most or only a few neighborhoods.
The otuliers are so dominating that it is hard to recognize this being a box plot. Let’s exclude the top percentile and run the box plots again.
Even after excluding the most extreme (top 1%) prices most neighborhoods seem to have a lot of large outliers. Many neighborhoods have a relatively large spread of prices, and the prices for most neighborhoods are right-skewed.
Except for one outlier date in 2015, the mean price have stayed fairly consistent roughly between 210 and 270,over the whole time period in the dataset (late 2013 to mid-2017). This makes me curious about whether there was a special event in San Francisco at the time the data was collected which increased the prices dramatically, or if this is due to some outliers. Let’s see if the median price follows the same trend.
At first glance the median price appears to be much more volatile than the mean price, but this is mainly due to the y axis being much more narrow in this last plot. Let’s plot the statistics together to get a clearer picture.
Here we see that median price in general follows the trend of the mean price. One interesting exception is the outlier date from the mean chart, which is not an outlier for median price. To me this indicates that the 2015 outlier date has some very high outlier prices. Let’s investigate this further.
There is not an unusal number of records for any dates in 2015, so record count is probably not a relevant factor. Next I’ll take a closer look at the data for the mean price outlier date.
## # A tibble: 6 x 6
## date_collected n mean median min max
## <date> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2015-08-21 5140 481.0601 179 10 28000
## 2 2016-02-17 8549 268.1229 175 0 10000
## 3 2016-06-18 7783 265.9597 174 10 10000
## 4 2016-09-17 8076 263.7397 175 1 10000
## 5 2016-10-19 8236 262.5647 175 1 10000
## 6 2016-04-15 8051 260.7369 170 10 10000
2015-08-21 is the outlier date.
Running the summary() function on all the records:
## room_id host_id room_type
## Min. : 958 Min. : 46 : 43
## 1st Qu.: 2433743 1st Qu.: 2597104 Entire home/apt:121681
## Median : 6988943 Median : 8539143 Private room : 76404
## Mean : 7020337 Mean : 18201785 Shared room : 7404
## 3rd Qu.:10772374 3rd Qu.: 25805964
## Max. :19781990 Max. :139553832
## NA's :6
## neighborhood reviews overall_satisfaction
## Mission : 25487 Min. : 0.00 Min. :0.0
## Western Addition : 20223 1st Qu.: 1.00 1st Qu.:4.5
## South of Market : 15860 Median : 5.00 Median :5.0
## Castro/Upper Market : 12098 Mean : 21.07 Mean :4.4
## Downtown/Civic Center: 11404 3rd Qu.: 22.00 3rd Qu.:5.0
## Haight Ashbury : 10524 Max. :513.00 Max. :5.0
## (Other) :109936 NA's :49 NA's :62326
## accommodates bedrooms price minstay
## Min. : 1.000 Min. : 0.000 Min. : 0 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 108 1st Qu.: 1.00
## Median : 2.000 Median : 1.000 Median : 167 Median : 2.00
## Mean : 3.082 Mean : 1.346 Mean : 252 Mean : 3.53
## 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.: 256 3rd Qu.: 3.00
## Max. :18.000 Max. :10.000 Max. :30000 Max. :1000.00
## NA's :8421 NA's :10690 NA's :71477
## latitude longitude last_modified
## Min. :37.71 Min. :-122.5 2015-08-21 16:54:47.397989: 3974
## 1st Qu.:37.75 1st Qu.:-122.4 48:03.8 : 24
## Median :37.77 Median :-122.4 25:46.6 : 18
## Mean :37.77 Mean :-122.4 49:41.8 : 18
## 3rd Qu.:37.79 3rd Qu.:-122.4 40:27.5 : 17
## Max. :37.83 Max. :-122.4 43:16.5 : 17
## (Other) :201464
## date_collected price_per_bedroom
## Min. :2013-11-17 Min. : 0.0
## 1st Qu.:2016-02-17 1st Qu.: 95.0
## Median :2016-07-17 Median : 130.0
## Mean :2016-07-02 Mean : 171.5
## 3rd Qu.:2017-01-14 3rd Qu.: 189.0
## Max. :2017-07-10 Max. :28000.0
## NA's :26169
Running the summary() function on the 2015-08-21 records:
## room_id host_id room_type
## Min. : 5193 Min. : 46 : 0
## 1st Qu.:1334904 1st Qu.: 1622704 Entire home/apt:3027
## Median :3578486 Median : 5488994 Private room :1879
## Mean :3692542 Mean : 9725157 Shared room : 234
## 3rd Qu.:6062615 3rd Qu.:14061126
## Max. :7983070 Max. :42076683
##
## neighborhood reviews overall_satisfaction
## Mission : 696 Min. : 0.00 Min. :1.00
## Western Addition : 544 1st Qu.: 1.00 1st Qu.:4.50
## South of Market : 369 Median : 7.00 Median :5.00
## Castro/Upper Market: 340 Mean : 21.22 Mean :4.74
## Haight Ashbury : 286 3rd Qu.: 26.00 3rd Qu.:5.00
## Bernal Heights : 267 Max. :371.00 Max. :5.00
## (Other) :2638 NA's :49 NA's :945
## accommodates bedrooms price minstay
## Min. : 1.000 Min. : 0.00 Min. : 10.0 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.: 124.8 1st Qu.: 1.000
## Median : 2.000 Median : 1.00 Median : 179.0 Median : 2.000
## Mean : 2.583 Mean : 1.33 Mean : 481.1 Mean : 4.938
## 3rd Qu.: 4.000 3rd Qu.: 2.00 3rd Qu.: 285.0 3rd Qu.: 3.000
## Max. :16.000 Max. :10.00 Max. :28000.0 Max. :365.000
## NA's :650 NA's :10 NA's :318
## latitude longitude last_modified
## Min. :37.71 Min. :-122.5 2015-08-21 16:54:47.397989:3974
## 1st Qu.:37.75 1st Qu.:-122.4 2015-08-21 16:56:54.438494: 1
## Median :37.77 Median :-122.4 2015-08-21 16:56:54.449790: 1
## Mean :37.77 Mean :-122.4 2015-08-21 16:56:54.455704: 1
## 3rd Qu.:37.78 3rd Qu.:-122.4 2015-08-21 16:56:54.458357: 1
## Max. :37.81 Max. :-122.4 2015-08-21 16:56:54.460898: 1
## (Other) :1161
## date_collected price_per_bedroom
## Min. :2015-08-21 Min. : 10.0
## 1st Qu.:2015-08-21 1st Qu.: 100.0
## Median :2015-08-21 Median : 140.0
## Mean :2015-08-21 Mean : 364.3
## 3rd Qu.:2015-08-21 3rd Qu.: 199.0
## Max. :2015-08-21 Max. :28000.0
## NA's :358
Here I’m mostly interested in comparing quantiles for price and price_per_bedroom. These variables are not drastically different between the 2015-08-21 records and all records.
The boxplots indicate that there are more outliers above price 1000 for 2015-08-21 compared to the other dates.
Looking at the three bar charts above, we see that the August 2015 date does not have exceptionally many records with price higher than the top 5% prices in the whole dataset, but the August 2015 records do have a very large amount of the top 2% of prices in the dataset, and almost three times as many of the 1% top pricesas any other date in the dataset. Since this is showing absolute counts, and some of the later dates have more records than the August 2015 date, this becomes even more significant.
Judging from the above plot, the August_2015 price followed the general price trend up to about the 90th percentile. Let’s see if we can get some more details by applying log transformation to the plot.
We get a little bit more details from the log transformation, but the trend stays the same:the August 2015 price follows the general trend until about the 90th percentile, when the August 2015 price starts to become much higher than for the same percentiles in the whole dataset. I will plot this one last time, zooming in on the top 8 percentiles.
Compared to the August_2015, the 92nd to 99th percentiles for all_records are very flat. From the 95th percentile, the August_2015 values are many times larger than the values for all_records, only being surpassed at the 100th percentile.
With the most extreme outliers removed, The August 2015 price mean are much closer to the mean of the other dates.
All of the neighborhoods have a median overall_satisfaction score of either 4.5 or 5. The mean score varies a little more, although all but two neighborhoods score 4 or better.
Based on the above plot it looks like price has a large impact on the overall_satisfaction score. There’s also a rather large gap between the median and mean prices, especially for overall_satisfaction score 1.
Keep in mind that only 1.5% of the records have an overall_satisfaction score of between 1 and 3.5, and 7.5% of the records have a score of 0. It is therefore most interesting to look at the price differences in the 4 to 5 overall_satisfaction range, where there’s a fairly clear trend: pricier units receive better scores.
The plot showing mean overall_satisfaction over time has a strange dip from the end of 2016 until the end of the date range. I suspect this has something to do with the remaining 0 values in the dataset. I’ll take a closer look at those next.
Not a single record before 2016-12-23 has an overall_score of 0. Since there is a small chance Airbnb actually allowed 0 overall_satisfaction scores starting late 2016, I won’t do anything further with these values. I will instead avoid using this variable in the rest of my analysis.
Unsurprisingly, units with more bedrooms cost more. The same trend is true for accomodations: the more people a unit can accomodate, the pricier the unit tend to be.
Due to the amount of dates, this is not very easy to read. Let’s make a line chart instead.
With the line chart we see a clearer picture. The proportion of Entire home/apt units have stayed fairly consistent throughout the time period, while private room units have increased and shared room units have decreased. I also notice that a few dates at the end of 2015 and in the beginning of 2016 lacks room_type information.
## # A tibble: 4 x 2
## date_collected n
## <date> <int>
## 1 2015-10-21 36
## 2 2015-11-21 1
## 3 2015-12-14 3
## 4 2016-01-16 3
It is reassuring to see that both of the two room types Private room and Shared room have a great majority of 1 bedroom units.
There is a fairly clear relationship between price and neighborhood. This makes sense, as some neighborhoods are more desirable than others. Based on local knowledge about San Francisco, it seems like the more central neighborhoods have more expensive units.
Except for some outlier price points in August 2015, the mean and median prices did not vary much over the 4 year period. The prices at the last date of data collections were actually lower than at most other dates.
When it comes to overall_satisfaction, there’s a fairly clear trend for the higher scores: more expensive units receive a higher score. Be aware that the overall_satisfaction data is rather fragile, and I had to exclude about 30% of the records from overall_satisfaction analysis due to null values.
There appear to be a relationship between neighborhood and overall satisfaction. However, the correlation is not very strong, as all but 3 neighborhoods’ mean score fall within 0.6 points of eachother.
Less interesting but worth mentioning is that there is a clear correlation between price and the number of bedrooms and people a unit can accommodate. Considering both of these variable function as proxies for unit size, this indicates the following finding: larger units cost more.
There are clear correlations between price and the number of bedrooms and price, and price and people a unit can accommodate. Both accommodation count and number of bedrooms function as proxies for unit size, and indicate the following finding: larger units cost more.
Entire home/apt units tend to have a higher price than private rooms, while shared rooms have the lowest prices of the three room_type categories. This make sense, as entire homes tend to be larger than private rooms. While less clear, entire homes also tend to have a higher per bedroom price.
The prices of entire home units have increased somewhat since the early days. Private rooms first have become a little bit cheaper, while prices for shared rooms, excluding one date in August 2015, have been fairly stable.
In the plots above I have excluded the units with 5 or more bedrooms (top 1%). With the exception of 4 room units, the median price for all bedroom sizes has stayed stable over time. The mean price is affected by the large price outliers in August 2015. Excluding that date, the mean price for 0-2 bedroom units have stayed stable. The 3 room units had a mean price increase from 2013 until early 2016, and has since seen a slow mean price decrease of about $100. The mean price for 4 room units saw a sharp increase until early 2016, when it stabilized around $800.
For room types Entire home/apt and private room, the median price trend of neighborhoods seem to match the overall trend, with a few exceptions. The price of shared rooms vary much more across neighborhoods, which I suspect is partially due to a low amount of data points in some neighborhoods.
Looking at this, Presidio stands out. While price per bedroom is fairly stable throughout the time period, the unit price varies drastically. Let’s take a closer look at Presidio to figure out what is going on.
Looking at the summary statistics, there seem to be a clear problem with drawing any conclusions from the Presidio data: the sample size per date is just too small, with most dates having less than 10 records. Let’s take a look at the sample sizes for all neighborhoods.
## # A tibble: 37 x 2
## neighborhood n
## <fctr> <int>
## 1 Presidio 155
## 2 Treasure Island/YBI 262
## 3 Golden Gate Park 354
## 4 Seacliff 429
## 5 Diamond Heights 446
## 6 Crocker Amazon 616
## 7 Visitacion Valley 735
## 8 Presidio Heights 1039
## 9 Lakeshore 1278
## 10 Glen Park 1889
## # ... with 27 more rows
Presidio is clearly the most troublesome neighborhood in this regard, having nearly 60% fewer records than the second lowest neighborhood. However, some of the other neightborhoods might also have too few records to be statistically viable. Let’s take a closer look at the records with less than 1000 total records.
To some extent all of these are troubling. With the exception of Crocker Amazon, all these neighborhoods have some dates with less than 10 records. However, only the Presidio has consistently around 10 or fewer records per date over time.
Excluding the low-record neighborhoods shows a clearer relationship between median price and median price per bedroom for all neighborhoods. I also notice that except for the first date, which we know has many fewer records compared to the other dates, the median price per bedroom within neighborhoods stayed fairly stable throughout the time period. The same is true for median price, except for Presidio Heights, which had some peaks on certain dates.
It looks to me like the most affordable units are widely distrbuted throughout all the neighborhoods, while the most expensive units can mostly be found in the most central parts of the city.
Neighborhood and price have a clear correlation, even after including time as a variable. Both median and mean prices vary a fair amount from neighborhood to neighborhood, and prices within neighborhoods stayed fairly stable throughout the date range.
I found it interesting how a relatively large dataset can still be too small for detailed analysis when you include multiple variables. The clearest example of this was how I had to exclude certain neighborhoods from my analysis due to having too few data points (less than 10) on certain dates.
In this plot the max price is set to 1000, which excludes the 2% most expensive units. Even when the large outliers are removed, the price data is still largely right-skewed.
All the neighborhoods have a higher mean than median. Some neighborhoods have drastically higher mean value than median value, indicating right-skewed data or an issue with outliers (a large amount of outliers or some very large outliers, or both.)
The first price quartile has a much wider distribution compared to the rest. For the fourth price quartile the units are densely located in the more central parts of the city.
Exploring the Airbnb data from San Francisco has been an interesting endeavor. Although I do not think I have discovered any dramatic new findings, there are plenty of interesting observations to be made. As an example, I find it interesting how price has stayed fairly stagnant throughout the four years, while the amount of rental units on the market have increased dramatically.
Since I chose to find a dataset on my own I had to be extra wary of potential data quality issues. Several times throughout my exploratory analysis I came across suspicious values, dips and peaks. This led to lengthy examinations, but in the end I felt more confident about what parts of the data were trustworthy, and which variables were too uncertain to continue focusing on.
The clearest relationships in the dataset are also the most obvious ones, namely the correlation between price and the proxies for rental unit size (bedroom count and how many people the unit can accommodate). There is also a clear correlation between price and neighborhood.
I have intentionally chosen not to look at changes for institution_id and room_id over time, as I have considered it to be outside the scope of this project. It would be interesting to find out how ownership changes over time, and to see if any of the variables for individual housing units ever change.